Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: catch the case where w[0] is an IndirectObject instead of an int #2154

Conversation

rchen19
Copy link
Contributor

@rchen19 rchen19 commented Sep 5, 2023

Closes #2137

@pubpub-zz
Copy link
Collaborator

👍
can you please add also a test, using the file you are reference and just extracting the text in page[0] (no need to assert any results) : that will prevent the issue to come back

@rchen19
Copy link
Contributor Author

rchen19 commented Sep 5, 2023

👍 can you please add also a test, using the file you are reference and just extracting the text in page[0] (no need to assert any results) : that will prevent the issue to come back

Will do, would you want it inside an existing test file, or a separate one?

And I assume the file goes to resources/, right?

@codecov
Copy link

codecov bot commented Sep 5, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (0ca4d37) 94.34% compared to head (dc0c8b5) 94.34%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2154   +/-   ##
=======================================
  Coverage   94.34%   94.34%           
=======================================
  Files          43       43           
  Lines        7572     7572           
  Branches     1488     1488           
=======================================
  Hits         7144     7144           
  Misses        263      263           
  Partials      165      165           
Files Changed Coverage Δ
pypdf/_cmap.py 93.68% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator

It sounds good to add it to test_cmap.py

@stefan6419846
Copy link
Collaborator

And I assume the file goes to resources/, right?

You will find some examples of files being downloaded on-the-fly inside the test code, see

pypdf/tests/test_writer.py

Lines 1040 to 1044 in 05f2a65

@pytest.mark.enable_socket()
def test_iss471():
url = "https://github.com/py-pdf/pypdf/files/9139245/book.pdf"
name = "book_471.pdf"
reader = PdfReader(BytesIO(get_data_from_url(url, name=name)))
for example. This especially holds true if you do not hold all required rights on the test files itself and cannot comply with the terms in https://github.com/py-pdf/sample-files#licenses.

@rchen19
Copy link
Contributor Author

rchen19 commented Sep 6, 2023

And I assume the file goes to resources/, right?

You will find some examples of files being downloaded on-the-fly inside the test code, see

pypdf/tests/test_writer.py

Lines 1040 to 1044 in 05f2a65

@pytest.mark.enable_socket()
def test_iss471():
url = "https://github.com/py-pdf/pypdf/files/9139245/book.pdf"
name = "book_471.pdf"
reader = PdfReader(BytesIO(get_data_from_url(url, name=name)))

for example. This especially holds true if you do not hold all required rights on the test files itself and cannot comply with the terms in https://github.com/py-pdf/sample-files#licenses.

The file I need is from arxiv, should be pretty easy to get a direct link. Thanks for the pointers.

@pubpub-zz
Copy link
Collaborator

you should use https://github.com/py-pdf/pypdf/files/12489914/Morris.et.al.-.2020.-.TextAttack.A.Framework.for.Adversarial.Attacks.Data.Augmentation.and.Adversarial.Training.in.NLP.pdf which is already within discussion

- a pdf file from arxiv is included
- URL too long

- file name too long

- variable declared but not used
@rchen19
Copy link
Contributor Author

rchen19 commented Sep 6, 2023

you should use https://github.com/py-pdf/pypdf/files/12489914/Morris.et.al.-.2020.-.TextAttack.A.Framework.for.Adversarial.Attacks.Data.Augmentation.and.Adversarial.Training.in.NLP.pdf which is already within discussion

The URL is too long to pass the code style check, I used the file directly from arxiv instead. But if file within github is preferred, I can shorten the file name and upload again.

@MartinThoma MartinThoma merged commit 4657df5 into py-pdf:main Sep 10, 2023
@MartinThoma
Copy link
Member

Well done, @rraval !

I've just merged it and will release soon. If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

MartinThoma added a commit that referenced this pull request Sep 10, 2023
## What's new

### Security (SEC)
-  Infinite recursion caused by IndirectObject clone (#2156)

### New Features (ENH)
-  Ease access to ViewerPreferences (#2144)

### Bug Fixes (BUG)
-  catch the case where w[0] is an IndirectObject instead of an int (#2154)
-  Cope with indirect objects in filters and remove deprecated code (#2177)
-  Cope with extra space (#2151)
-  Merge pages without resources (#2150)
-  getcontents() shall return None if contents is NullObject (#2161)
-  Fix conversion from 1 to LA (#2175)
-  Accept tabs in cmaps (#2174)

### Robustness (ROB)
-  Accept XYZ with no arguments (#2178)

[Full Changelog](3.15.5...3.16.0)
@rchen19
Copy link
Contributor Author

rchen19 commented Sep 11, 2023

Well done, @rraval !

I've just merged it and will release soon. If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

Sure, that sounds nice. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

possible bug with error TypeError: 'IndirectObject' object cannot be interpreted as an integer
4 participants